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Abstract 

In the recent years, the desire and need to understand sequential data has been increasing, with par¬ 
ticular interest in sequential contexts such as patient monitoring, understanding daily activities, video 
surveillance, stock market and the like. Along with the constant flow of data, it is critical to classify and 
segment the observations on-the-fly, without being limited to a rigid number of classes. In addition, the 
model needs to be capable of updating its parameters to comply with possible evolutions. This interest¬ 
ing problem, however, is not adequately addressed in the literature since many studies focus on offline 
classification over a pre-defined class set. In this paper, we propose a principled solution to this gap by 
introducing an adaptive online system based on Markov switching models with hierarchical Dirichlet pro¬ 
cess priors. This infinite adaptive online approach is capable of segmenting and classifying the sequential 
data over unlimited number of classes, while meeting the memory and delay constraints of streaming con¬ 
texts. The model is further enhanced by introducing a ‘learning rate\ responsible for balancing the extent 
to which the model sustains its previous learning (parameters) or adapts to the new streaming observa¬ 
tions. Experimental results on several variants of stationary and evolving synthetic data and two video 
datasets, TUM Assistive Kitchen and collated Weizmann, show remarkable performance in segmentation 
and classification, particularly for evolutionary sequences with changing distributions and/or containing 
new, unseen classes. 


1 Introduction and related work 

The joint problem of time segmentation and recognition of sequential data into meaningful sub-sequences 
has attracted significant research in a variety of domains. The ability to automatically segment and classify 
data is a core technology for applications like speaker diarisation, finance, activity understanding, mul¬ 
timedia annotation and human-computer interaction. To date, the main proposed solutions have included 
sliding windows (il, the hidden Markov model (HMM) m , conditional random fields ouii, and structural 
SVM 0, covering the spectrum of generative, discriminative and maximum-margin dynamic classifiers. 
Along with advancements in learning and inference, research has witnessed increasingly realistic datasets 
which are bridging the gap between lab and real applications mm^ 

Nevertheless, important challenges such as model adaptation and dynamic class sets remain unresolved. 
We address both these limitations by an adaptive online model that can accommodate an unlimited (theo¬ 
retically infinite) number of classes. In a nutshell, this is achieved by applying a Bayesian non-parametric 
model, the hierarchical Dirichlet process (HDP), as the prior for a hidden Markov model (a model known 
as HDP-HMM El (91), and exploiting an adaptive learning rate for model adaptation. The proposed model 
provides an adaptive online learning approach for joint segmentation and recognition of sequential data 
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with incremental class sets and we refer to it as AdOn HDP-HMM in the following. The model is i) on¬ 
line: can receive sequential data in batches and segment and recognise them on-the-fly; ii) adaptive: using 
a limited memory buffer, the model can tune its parameters in response to diverse observations from the 
existing classes, as well as instantiating new unseen classes. It continues learning throughout the entire life 
of its application; and is hi) only-initially supervised: the model uses a relatively short initial bootstrap of 
supervised training, but it adapts in a fully unsupervised manner during its operation. It is also considered 
as a one-pass process of streaming data, without revision. These constraints obviously make adaptation 
much more challenging, yet suiting the model to a large span of real-life problems. To improve adaptation 
in such an unsupervised learning scenario, we introduce the notion of 'learning rate \ that tunes how biased 
the model is towards its previous learning (memory), versus adapting to the patterns conveyed by the new 
observations (adaptability). Experiments support the efficiency of utilising a learning rate, particularly in 
evolving scenarios. 

The rest of this paper is organised as follows: in the rest of this Section we present the related literature 
and provide more clarification to the scope of this study. In Sectionwe describe the hierarchical Dirichlet 
process and its temporal extension HDP-HMM. Sectionj^presents the proposed online approach, expanding 
on the adaptive learning rate. Through the experiments and discussions in Section we evaluate and 
compare the proposed variants with existing benchmarks, and conclude in Section 

1.1 Related work 

Amongst the many paradigms available for class modelling, hierarchical Bayesian modelling and, in par¬ 
ticular, the hierarchical Dirichlet process (HDP) O offer a principled way to infer an arbitrary number 
of classes from a set of samples via a hierarchy of prior distributions. The hierarchical Dirichlet process 
(HDP) is a Bayesian nonparametric technique estimating the joint posterior distribution of a set of latent 
classes and a set of parameters, typically by Gibbs sampling ITOl or variational inference ifTTl . It has been 
used for a variety of applications, including the modelling of sequential data, by integrating HDP priors 
into state-space models such as HMM. In the resulting HDP-HMM |0 ill, the classes correspond with 
the discrete states of a Markov chain and the data are explained by a state-conditional observation model. 
Given a set of samples, classification is performed by state decoding, while allowing the number of states to 
dynamically grow or shrink. The hierarchical Dirichlet process is finding increasing application in domains 
as varied as bio-informatics, speaker diarization, vision and others for problems of joint segmentation and 
classification (see Ga da m for some recent references). 

Most of the segmentation and recognition studies in the literature follow an offline approach, where the 
entire data set is presented at once during the learning stage 0 Q. Such systems obviously do not suit the 
needs of streaming data which are ubiquitous in today’s applications. In response to this increasing demand 
for online systems, many studies are dedicated to this topic. However, the term online has been given a 
variety of meanings in different contexts. Our interpretation is sequential processing of temporal data in 
mini-batches, inspired by recursive Bayesian estimation ca and further elaborated throughout this paper. 
This interpretation is distinct from that of other studies in the literature where online refers to a closed 
dataset that is processed incrementally and possibly repeatedly, such as Bayesian online nonparametrics 
ca El, stochastic optimisation methods ifTsl El, formal bounds for online learning 1^ . all based upon 
the foundations laid by seminal works such as 1^ . 

Despite that almost all the proposed approaches consider closed, pre-defined sets of classes, in scenarios 
like long-term learning or monitoring the number of classes is not precisely predictable. Additionally, as 
more data stream in, the known classes may change in parameters due to observing a more comprehensive 
sample or a natural evolution over time. In either case, models are expected to update parameters of the 
known classes and add new classes to their vocabulary once they appear. Unsupervised adaptation can be 
very challenging in non-stationary domains, where adaptation and drifj^are hardly distinguishable. To our 
knowledge, a frequent assumption in online studies is to avail of periodic or ad-hoc feedback from the user 

^Defined as an undesirable deviation from the ideal model. 
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(active learning ll^ 1211 lIT^ ). This feedback allows the model to evaluate the regret and re-dress possible 
drifts and misclassifications. However, such information is hard or costly to obtain in many real application 
domains. 

In the absence of expert feedback, we elaborate more on the learning rate as a dynamic lever for 
balancing adaptability (section [33- Most previous studies approach this problem by assigning constant 
weights to prior learning and the likelihood of the current data. However, in more complex problems the 
choice of the learning rate is highly dependent on the data dynamics and the application domain. Some 
online studies propose adaptive learning rates via exponential decay 1 ^ . and, more recently, regret-based 
adaptations of the learning rate (i.e., the step size of gradient descent) d El CD. However, such 
adaptation strategies are only suitable for finite training sets. In our solution, we introduce a novel learning 
rate that constantly adapts to the statistics of the streaming data, without revision or supervision. For 
stationary problems where the parameters only slightly change, the learning rate tunes itself to rely more 
on prior memory. Conversely, under evolving distributions the dynamics of data and their modes can 
significantly vary, calling for a more adaptive model with less inertia to the past. Adding to the complexity, 
many real-life problems require a mixture of both, i.e. a continuous spectrum for the learning rate to follow 
more or less tightly the dynamics of observations at each point in time. In this work, we tackle this problem 
by a posterior estimation of the learning rate separately for each parameter in the model - thereby, allowing 
each parameter to dynamically determine its adaptability in each batch. 


2 The hierarchical Dirichlet process 

A Dirichlet process, DP( 7 , P), is a generative model that can be thought of as a distribution over discrete 
distributions with countably infinite categories. It is controlled by a scalar parameter, 7 , known as the 
concentration parameter, and a base measure, H, over a measurable space 0. A sample Go from a Dirichlet 
process is a distribution over 0 differing from zero at only a countably infinite number of locations or atoms, 
Ok,k = l...K: 


K 

Go = Y,M{0~0u), if ^00 (1) 

k=l 

Ok^ H, /3 - GEM{-f) 

The discrete set of locations is obtained by repeatedly sampling the base measure, while the weight for 
each location, Pk^k = 1... P, is established by a stick-breaking process, noted as GEM( 7 ) (named after 
Griffiths, Engen and McCloskey) (251 . We refer to the weight vector simply as A hierarchical Dirichlet 
process (HDP) consists of (at least) two layers of Dirichlet processes, obtained with a similar construction: 


Gj - a, H) : 

Go^DP{^,H) 

K 

7rjkS{0 — Ok) K ^ 00 

k=l 

Ok - P, TTj - DP {a, f3), f3 GEM{-f) 


( 2 ) 


where 7 and a are the concentration parameters of the top-level an lower-level Dirichlet processes, 
respectively. Since Go is discrete, the various Gj {j = 1... J), are also discrete and sampled from the 
elements of Go (Figure [T]). 
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Figure 1: The Dirichlet Process (a) and Hierarchical Dirichlet Process (b) construction; the parameter space 
has been simplified to one dimension for the sake of visualisation. 



Figure 2: HDP-HMM graphical model. The box notation is used to show replication. 


In practical applications, the continuous space of distribution H is taken to be the parameter space for 
a data likelihood, as in y r\j f{y\o) ■■ 0 r\j H. Likelihood f{y\0) could be, for instance, a Gaussian 

distribution of mean parameters 0 sampled from a Normal-Inverse-Wishart (NIW) distribution. Given the 
generative model of the HDP, the joint distribution of data and parameters factorises as f{y\0)Gj{0). Typ¬ 
ically, multiple Gj are sampled to model data belonging to different groups. Yet, the hierarchical structure 
of the HDP makes all the Gj usefully share distributional properties. Examples can be as diverse as words 
in a collection of books or genetic markers across different populations. 

2.1 The HDP-HMM 

The HDP has also been used as prior distribution for the parameters of switching models such as the hidden 
Markov model (8) d. When applied to a Markov chain, zi^t, p(^ 1 :t) = p{zi) Ylt =2 the HDP 

changes its interpretation significantly (Figure |^. In this case, each ttj = {tt^/c}, /c = 1... Ff, is used as 
one row of the Markov chain’s transition matrix, representing the probability of transitioning from state j in 
the previous time-step to any other states in the current time-step, p{zt\zt-i = j). Thanks to the properties 
of HDP, new states will be created when the data are not adequately explained by the current set of states. 
In contrast to the conventional HDP, the index of the group, j, of each observation is usually not known 
explicitly anymore, but it is instead inferred in sequential order from the chain. Therefore, in the case of 
the HDP-HMM Zt ^ p{zt\zt-i = j) = tt^, yt ^ f{yt\0zt)- As a consequence, in the HDP-HMM the 
number of groups (J) and the number of indices in each tt^ (AT) coincide. Adding the HDP as prior caters 
for arbitrary number of states, or activity classes ca. 

It is worth adding that a reported limitation of HDP-HMM is the tendency to over-segment due to its 
unbounded number of classes 1^ . Fox et al. have proposed adding a 'sticky' prior {n) to the transition 
matrix to emulate an inertia towards changing states, illustrated in Figure (23. We utilise the sticky prior 
in this study, yet denoting it as HDP-HMM for brevity. 
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Figure 3: Adaptive online learning flowchart: initialised by a supervised bootstrap, the learning continues 
unsupervised over streaming data split into batches. The flgure shows a general case with batches of variable 
size. For simplicity we assume all batches to have the same size (see |[28l for the variable alternative). The 
posterior (j) in each batch is passed to the next batch as prior parameters. 


2.2 Inference and Learning 

Inference and learning are typically performed simultaneously in the HDP and its extensions by estimating 
the joint posterior distribution of the indicator variables, parameters, hidden variables and hyper-priors 
conditioned on the observations. Deriving such an extensive joint posterior is analytically intractable, hence 
mainly inferred using Gibbs sampling or variational inference. Gibbs sampling is a simple yet effective 
method capable of estimating complex posteriors with signiflcant accuracy, yet it can converge slowly or 
permanently remain in a local minima {poor mixing). Variational inference is usually faster to compute, 
however it requires prior derivation of analytical approximations and can suffer from low accuracy due to 
the approximation. Unlike the negative presumption about Gibbs efficiency, we will show how a brief initial 
supervised learning can result in rapid convergence to accurate distributions. 

Having inferred the class indicators, zi^t, we proceed with translating the indices into meaningful 
classes. In unsupervised learning, the correspondence between the ground-truth classes of data and the 
labels assigned by the classification algorithm may not be obvious. In the case of the HDP, this problem 
is exacerbated by the fact that the number of classes is undetermined. Therefore, to re-establish the best 
possible one-to-one correspondence, the Hamming distance between ground-truth and assigned labels is 
minimised by a greedy algorithm, matching labels in decreasing frequency order. 


3 The Adaptive Online HDP-HMM 

The proposed AdOn HDP-HMM uses a supervised initialisation (bootstrap) of frames, followed by 
the main unsupervised adaptive online inference (Figure [^. The extent of the supervised phase varies 
with the application: in applications where annotation is easy, the bootstrap can be longer to provide a 
more comprehensive training, while in domains with costly annotation the bootstrap will be brief. In either 
case, during supervised learning, indicator variables zi:Tb are fixed to their ground-truth values, and the 
model’s parameters are sampled for a given number of iterations to reach convergence. After conclusion 
of the bootstrap phase, the data are processed in successive batches, and the posterior probabilities of both 
indicator variables and parameters are estimated iteratively on each batch. 

Considering a generic stream of data, yi:t, the posterior probability of the parameters can be written as 
p{(t^\yi:t) oc /(^i:t|0) p(0), where 0 indicates the parameter vector of Figure]^ In the case of the HDP- 
HMM, the parameter vector is (j) = {6>, tt, where 0 are the parameters of the emission densities, tt are 
the transition probabilities (and weights of the lower-level DPs), and p are the weights of the higher-level 
DP. Further, since we assume normal densities, we have 6> = {/i, S}, with fi and H the usual mean and 
covariance parameters. The online version leverages on posterior adaptation, using the posterior computed 
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Figure 4: Graphical model for the proposed adaptive online model, (j) can be any of the parameters in 
Figure 1^ (6>, TT,/3), r is the respective learning rate (a positive continuous variable) and n represents the 
batch number. 


up to time f, as the prior for the next batch of data, yt+i-.t+At- 


p{(t)n+l\yi:t+At) OC /(^t+l:t+At |0n, 

~ /(^t+l:t+At |0n) P{4^n) 

where n is the batch number (Figure |^. Given that the updated posterior embeds the distributional 
properties of the observations up to the current time, observations yi^t in Equationcan be discarded after 
adaptation. It implies that the accumulated sufficient statistics of previous data are propagated paramet¬ 
rically and the non-parametric nature of the model is related to the inference method of the current data 
batch. With that, the model carries all the prior learning and infers new labels using a limited memory 
buffer. While this may come at a price of reduced accuracy, to our knowledge it is the only viable approach 
for unbounded streaming data. In contrast lIT^ presents online inference for latent Dirichlet allocation, 
yet over an unbounded buffer. Our work extends that model to infinite class sets while meeting the finite 
memory requirements of sequential data processing. 

3.1 Learning rate adaptation 

In the proposed adaptive system, a learning rate is applied over the prior and noted as r in the following. 
In each batch, r is responsible for setting the weight of prior learning on the model’s parameters (6>, tt, f3). 
In other words, our target is to balance the impact of the current observations with the previous learning 
accumulated along the previous batches. This can augment or weaken the posterior learning ‘inertia’ in 
‘adapting’ to the current data (likelihood), as opposed to retaining ‘memory’ (prior). 


p{(f>\y,T) ocp{y\(j))p{(f)y (4) 

It is worth noting that the length of the current batch compared to the number of past samples, plays a 
role in their relative influence on posterior parameters (see Appendix A for more details). Accordingly, r 
can be articulated as a scaling factor to the number of ‘pseudo-observations’ in the prior to balance with the 
respective number for current batclj^ 

For prior distributions belonging to the exponential family, this proposition does not violate Bayes’ 
Theorem, thanks to the properties of canonical parameters. Accordingly, we use exponential family like¬ 
lihoods and priors for easier integration of the learning rate into the model. Hereby, we focus on the prior 
in Equation (in bold font) and its hyperparameters, translating them into exponential family notations. 

^For convenience, in this paper we have constrained all batches to be of the same length and explored the variable alternative in 
(28). 
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The standard parameters, 0, are converted into the corresponding canonical parameters, 0, and we make 
explicit their dependence on hyper-parameters, 77 : 


p{e\riy = p{Q\v, t) = fivYgi^y exp {Q'^riY 

= fiv) exp (T0'^r?) , ©' = [ln{g{e))\e] 

Adding the learning rate (r) as an exponent to this prior does not alter the type of distribution. Rather, 
it updates the canonical parameters of the prior, ultimately affecting its weight in the resulting posterior. 
Please note that we only need to derive a proportional posterior for sampling purposes. Hence, the r 
exponent on any term independent from 0 (such as f{r])) can be ignored thanks to the proportionality. The 
normalisation coefficient g{Qy can be merged into the sufficient statistics, assuring that its r exponent is 
absorbed into the scaled canonical parameter (tt]). 

In general terms, the posterior distribution of r given 0 in the presence of N data samples in Y can be 
inferred as follows: 


p{r\0,Y,7]) (xp{e\T,Y,7])p{T) ( 6 ) 

In our case, 0 are the parameters of the HDP-HMM and their priors are a Normal-Inverse-Wishart 
distribution for g and S and the HDP for tt and p. Given that both the NIW distribution and the Dirichlet 
process are members of the exponential family. Equation [7] shows a unified way of inferring posterior 
parameters in canonical form 1291 : 


p{e\Y,T*,ri*) ocp{Y\e,T,7])p{e\7],T) 


p(0|Y,T*,r?*)oc 


N 


N 


n Hy^gi^) exp 0 ^ y] u{yn) 


Vn=l 


n=l 


[f{v, T)g{Qy exp (t 0 ^? 7 )] 


removing the constants with respect to ©.• 


N 


(7) 


p( 0 |Y, T*, p*) cx g{Qy^^ exp I E] + Tp 


\n=l 


N 

t*=t + N, p* = y^^u{yn) + Tp 

n=l 

In the following sub-sections, we present the prior distribution of each parameter under the learning 
rate, and the posterior distribution of the corresponding learning rate. 

3.1.1 Inference of covariance matrix E 

We infer fi and E in the Normal-Inverse-Wishart prior by first sampling E using an Inverse-Wishart (IW) 
distribution, thereby using E to sample ji from a Normal distribution l3Ql . The learning rate for E is noted 
as Ts in the text. Yet, to avoid cluttering the notation in the equations, we simply note it as r in Equation 


^g{SY exp(T©^ry) = exp (Tln{g{Q)) + rf)) = exp {T[ln{g{Q))] ©]^[1; g]) 
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As mentioned earlier, the addition of a positive learning rate as exponent on the IW prior does not 
alter the type of distribution and can be merged into the hyper-parameters. Below, we convert the hyper¬ 
parameters (^iw = {^5 into the natural form (r]) to show the impact of r more clearly. Ultimately, they 
are converted back to standard form (0) to show the linear transformation caused by the learning rate. 


4>iw = (4', 1^) Viw = 


Viw = TViw = 



1 , V + P+ 1 

-^- L - 

2 ’ 2 


r(z/ + p + 1) 


^'iw 


p = number of dimensions 
(r^,r(z/+p-l- 1) - p- 1) 


(9) 


Inference of rs 

To sample from the posterior, ideally we would like to consider a conjugate prior that analytically 
derives the posterior hyper-parameters, given that of the prior and the sufficient statistics of the current data. 
A candidate conjugate prior for IW distribution is Gamma. However, the Inverse-Wishart is only conjugate 
to the Gamma as the prior for the scale parameter (or a scaling coefficient for the scale parameter, in the 
multivariate cases). Hence, a Gamma cannot be used as a conjugate prior for deriving the posterior of in 
a maximum-a-posteriori solution (Appendix B presents the proof). 

Therefore, we utilise a maximum-likelihood solution to derive the posterior hyper-parameters for . 
The posterior for is modeled using an Inverse-Gamma (IG) distribution, the univariate correspondent of 
the Inverse-Wishart. The samples of IG are positive real values, suitable for the scalar learning rate r^. The 
distributions are displayed below. 




2fr(! 


exp ( 


IW{a\^,v) = — 


Ip 2 1 ^ + 2 


2tr(f) 


<7- ^ exp 1 -- 


V’ 


univariate IW: p = 1 


( 10 ) 


IG{T\/3,a) = “ ^exp 

r(a) 


/? 


Comparing the univariate IW and IG in Equationwe can derive the posterior parameters as: 


IG{T\l3*,a*) K IW{Y;\^,v) where | 


( 11 ) 


As can be seen, the hyper-parameters in the Inverse-Wishart map to those of the Inverse-Gamma. The 
only issue is how to best map the p x p scale matrix (T^) into scalar value /3* through /(^). We propose to 
use the largest eigenvalue in as the scale parameter in the Inverse-Gamma posterior. Approximating a co- 
variance matrix via its first principal components is a meaningful and common approach lISTIl (321. Another 
choice for /(T^) could be the determinant. The determinant is used as the scale associated with a square 
matrix since it is equal to the product of all its eigenvalues. While it gives a more thorough account of all 
the eigenvalues, it becomes unsuitable when the dimensionality is high and many of the eigenvalues are 
close or equal to zero. Moreover, calculating the determinant of a high-dimensional matrix is very costly in 
an online context. Therefore, the first approach is generally preferable. 

So far, we have established a way to infer from a single IW distribution belonging to a single state. 
Considering the proposed model with infinite states, we need to merge the IW parameters for all classes to 
infer r^. This is done through a weighted average of /c = 1.. .K, where the weights are the frequency 
of observations for each state, aka degrees of freedom in IW parameters (z^). 
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IG{T\l3\a*)^ p- 


( 12 ) 


T.k=i'^ax{eig{'^k))-i'k 

2 X//c=l 


sr^K 

l^k=l 


a = 


2i^ 


3.1.2 Inference of mean /i 

Having inferred S, the next step is to derive the multivariate mean, /i, in the NIW prior. Let us consider a 
generic multivariate Normal distribution N = (/ij/io, with known covariance. To observe the impact 
of the learning rate, we convert its parameters = (/io, into the natural form and multiply them by 
the learning rate r^, and ultimately revert them back to the standard format: 

<i>n = (l^o, -S) -^ri^= (-S“Vo, Ls 

(j y (7 Z(j 

iV(Ml<)=V(Ai|Mo,Ll]) 

'TfT 



Inference of 

Posterior sampling of is conducted with a similar approach to r^, but using a Gamma conjugate prior. 
This time the Gamma prior is conjugate by definition, since its sample is utilised merely as a scaling 
coefficient for the covariance. The detailed proof of the conjugacy is provided in Appendix C. Similarly to 
te, the weighted average of sufficient statistics across all classes is used to infer r^. 


G(r|Q(*,/3*) (X A/'(/i|/io, — T.)G{T\a,P) 


^ ^/c=l i ^ Ok ) (A^/c A^0/c)-^/c 

2 ^k=l 


a 


OL + 1/2, 




(14) 


3.1.3 Inference of the HDP transition parameters 

Thus far we have discussed the adaptation of the learning rate for emission parameters. The other main set 
of parameters in our AdOn HDP-HMM are the HDP’s p and tt parameters that jointly and hierarchically 
cater for the transition probabilities. The distributions of these parameters are shown in Equationwhere 
m and n are HDP sufficient statistics representing the frequency of occurrence in each class: 

Dir{j/L + m.i,...,j/L + m.L) 

TTj ~ Dir{ai/3i + uji, + k + rijj, ...,aLpL + riji) 

Similarly to the previous parameters, we illustrate the impact of the learning rates, and on the 
hyper-parameters of the above Dirichlet distributions in standard form and infer the posterior samples for 
these learning rates: 
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Inference of the learning rate for p 

(pp = (7/L + m.i,...,7/L + TO.L) -^1113 = {l/L + m,i - 1 , 7 /L + m.L - 1 ) 

V'p = Tpm = + rn,i) - Tp,Tpi-y/L + to.l) - Tp) 

P ~ Dir{Tp{j/L + TO.i - 1) + 1, ...,Tp{j/L + m,L - 1) + 1) 


Inference of the learning rate for tt 

Pit = {o^lPl + nji,...,ajPj + K + Ujj, ...,aLPL +njL)) 

-^V'k = {T7r{aiPl + Uji) - T„, TT,{ajPj + K + Ujj) - T„, T^{aLpL + “ T^) 


TTj ~ Dir{T„{aiPi +nji - 1 ) + 1 , ...,TT,{ajPj + K + rijj - 1 ) + 1 , ...,TT,{aLPL + - 1 ) + 1 )) 

(16) 

Inference of r /3 and 

To the best of our knowledge, there are no conjugate priors over a scaling factor for the parameters of 
a Dirichlet distribution, in the presence of an intercept. Hence, we estimate the next batch’s learning rate 
using a Metropolis-Hastings (MH) jump. This approach is used in several other studies (such as Ea (Ml) 
and is a valid MCMC move. For the MH step, one can choose a suitable candidate function and the samples 

are accepted with probability of acceptance ^ ex min ^ 1 , ^ • 

To sample T/ 3 , we have selected the candidate function as G(t/ 3 |q(, ^5), the prior over the learning rate. 
The new sample (r^ is accepted with the probability in Equation updating for the current batch with 
the accepted sample. An identical approach can be taken for by replacing ttj for (3 in Equation The 
(3 subscripts in r /3 are removed to avoid notational clutter. 


p(r ^ T*) (X min 

p(T*|a,/?)Q(T ^ r*) 
p{r\a,P)Q{T* t) 


p{r*\a,P)Q{T T*) \ 
p{T\a,p)Q{T* -^t) ) 
Dir{p\a, t*)G{t*)G{t) 
Dir{p\a, t)G{t)G{t*) 


Dir{P\a, r*) 
Dir{P\a, r) 


(17) 


3.2 Discussion on the learning rates 

As per the above sections, the learning rates for each parameter are inferred separately to allow more 
degrees of freedom for independent adaptation of each parameter. The empirical results support this, as 
each of the learning rates (r^, T/ 3 , Tt^) can adapt differently for the same sequence of data, depending 

on the complexity of the data and degree of evolution in the emissions and state transitions. Nevertheless, 
their impact pattern on the mean and covariance of the respective posterior distributions tends to be similar. 
As clearly shown for (Equation the learning rate does not change the mean, but reversely impacts 
the covariance (see Appendix D for more details). Accordingly, for all cases when 0 < r < 1 the posterior 
distribution is more driven by the current observations. However, for r > 1 the inferred parameters follow 
the prior distribution more closely. In the following experiments, the dynamics of r with respect to the data 
is explored more extensively. 


4 Experiments 

The experiments aim to explore the effectiveness of the proposed AdOn HDP-HMM for segmentation and 
classification in a variety of scenarios. To closely examine the adaptability of the model, we have designed 
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several synthetic datasets with stationary and evolutionary distributions. It also allows us to investigate the 
effects of using learning rates in enhancing adaptability, where an adaptive learning rate is noted concisely 
as ‘ada r‘ and the basic alternative with fixed learning rate is shown as r = 1. Following with two more 
video datasets, we demonstrate the performance of the proposed model in various challenging sequences 
with noisy data, abrupt changes and new classes in the test data. It is important to mention that the degree 
of challenge in the synthetic experiments is not easily comparable to the video data, due to differences in 
the nature of the signals, noise and, most importantly, degree of evolution that is stronger by design in the 
synthetic data. Hence, analysing both categories of experiments can shed more light on the adaptability of 
AdOn HDP-DHMM in various contexts. 

To evaluate the results more comprehensively, metrics for both classification and time segmentation per¬ 
formance are introduced. For classification accuracy, we have used frame-level comparison of the decoded 
classes with the ground truth (based on Hamming distance). To evaluate time segmentation, the standard 
metrics of precision and recall are utilised to indicate the accuracy of detecting boundaries between seg¬ 
ments. A true boundary is regarded as correctly detected if a change of state is decoded within an interval 
of ±At frames from the ground truth location, where At is set to 10 percent of the average segment length. 
Any additional detected boundaries are counted as false positives. We also report the difference between 
the overall number of actions detected in the test sequence and the number of actions in the ground truth 
(noted as cardinality, with an ideal value of zero). 

The empirical results are quantitatively reported in tables, also visualised in colour plots of ground- 
truth vs. estimated labels. In each illustration (for instance, Figurej^, the horizontal axis is the time and the 
estimated labels are plotted on top of the true labels, providing a qualitative measure for the segmentation 
and classification performance. These plots are best viewed in colour. 

4.1 Synthetic data 

The basic framework of the synthetic dataset is generated from a univariate HMM, with 5 states distributed 
around dispersed means (/i = [100, 200, 300,400, 500]) with unit variance and a Dirichlet-distributed tran¬ 
sition matrix {a = [3, 3, 3,3, 3]). This generative model is similar to the AdOn HDP-HMM, but not an 
exact replicate, due to the absence of the HDP prior and adaptation of r in the generative process (please 
refer to Figuresandfor comparison). 

4.1.1 Stationary distributions 

Given the above basic configuration, the stationary experiments are run over 3 sequences of length 100, 
trained using leave-one-out cross validation. Hence, the distributions of training and test samples are the 
same. The test sequence is split into batches with approximate size of 16 time units. To provide adaptation, 
the inferred parameters of each batch are propagated into the next batch as priors. 

The proposed Adaptive Online HDP-HMM is able to recognise and segment this basic version with 100 
percent accuracy, whether or not the learning rates are used. To probe the model further, we add a significant 
noise to the above model by increasing the standard deviation to 50, thereby causing a considerable overlap 
between the distributions of each state (Figure |^. Despite this substantial noise, the model is significantly 
accurate with an average of 76.3 percent frame-level accuracy. Repeating this experiment on the same data 
yet with fixed learning rates (r = 1), shows a noticeable decline in accuracy of 3 percentage points and 
undesirable extra states. Table first two rows, shows the detailed accuracy figures in terms of precision, 
recall and number of inferred states. 

4.1.2 Evolutionary distributions 

A more advanced experiment is designed by training the model on synthetic data with evolving distri¬ 
butions, either involving gradual shifts to the means of each class or including new unseen classes. The 
standard deviation for this experiment is set to cr = 10. 
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Figure 5: (a) The distribution of states in the noisy synthetic data set. Note that due to the large standard 
deviation (a = 50) there is a significant overlap between the states, making recognition a challenging 
task.(b) The sum of density functions for all states. 



Accuracy 

Recall 

Precision 

Cardinality 

Stationary, Noisy {ada r) 

0.76 

0.92 

0.92 

0.33 

Stationary, Noisy (r = 1) 

0.73 

0.89 

0.93 

1.7 

Evolutionary, shifting mean {ada r) 

0.97 

0.97 

0.99 

0 

Evolutionary, shifting mean (r = 1) 

0.71 

0.99 

1 

1 

Evolutionary, new class {ada r) 

1.00 

1.00 

1.00 

0 

Evolutionary, new class (r = 1) 

0.86 

0.86 

0.98 

0 

Evolutionary, combined {ada r) 

0.93 

1.00 

1.00 

1 

Evolutionary, combined (r = 1) 

0.81 

0.95 

0.97 

2 


Table 1: Frame-level accuracy, segmentation recall, precision and state cardinality error for the synthetic 
experiments. Each table section includes the respective results of experiment secions, comparing the per¬ 
formance with and without the learning rate: i) Stationary with high noise reporting average results on 3 
sequences, ii) Evolutionary with shifting means, iii) Evolutionary with new states, iv) Evolutionary with 
combined shifts and new states. 



Figure 6: Segmentation and classification results for evolutionary synthetic data, using fixed (r = 1) and 
adaptive (ada r) learning rates. Top half of the stripes: predicted states', bottom half: ground truth. (a,d) 
Shifting means: Without the adapting effects of the learning rate, shifting means can be misclassified as 
new classes (yellow) in d. (b,e) New class: is shown in yellow in the ground truth. The new class has 
been recognised and learnt in both cases, assigned a random colour (orange in (b) and red in (e)). (c,f) 
Combined: Adding both challenges causes slight decrease in accuracy and an extra new class. Nevertheless, 
the considerable performance gap in utilising r is still visible. 
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Shifting class means: To examine the adaptability of the model, we drift the class means by ^ = 0.5 
at each time step. Therefore, an instance appearing at t = 10 in the test sequence is generated from a 
distribution with its mean shifted by 5 units. For a non-adaptive model and given the synthetic generation 
scheme, such data can cause significant classification errors after a few tens of time units. However, the 
results of the Adaptive Online HDP-HMM demonstrate smooth adaptation and excellent accuracy over the 
evolving sequence (Figure [^. There are a few misclassifications towards the end of the sequence which 
are due to the heavy distributional drift. Comparison between these results and that of fixed learning rates 
shows a significant drop of 26 percentage points in accuracy and one undesirable new class (see Table 
section ii). 

New classes: In this experiment, distributions do not shift, yet one new class appears around /i = 600 
with the same a as the other classes. The model is able to create a new state (shown with a random new 
colour in Figure [^, learn and consistently recognise it in the later batches without distorting parameters 
of the existing classes. The overall accuracy of 100 percent for this experiment is mostly thanks to the 
contribution of the learning rate in adjusting the variances of each class with respect to the degree of adap¬ 
tation. Not using learning rates can highly reduce accuracy (14 percentage points) due to drift in the existing 
classes (Table [T]), also exhibiting one extra class and reduced recall and precision. 

Combination of the two: Combining the above two evolutionary scenarios, we test the proposed 
model on a sequence with a new class that needs to be distinguished among the existing shifting classes. 
The challenge is two-fold: i) the shifting modes are prone to being misclassified as new classes, and ii) 
the new class might be merged into one of the existing shifted modes. This experiment is the closest to 
challenging real world scenarios where new states are likely to appear while the distributions can change 
over time. Given the combined challenge, the AdOn HDP-HMM proves highly accurate (93 percent), 
exhibiting a considerable improvement on the accuracy (12 percentage points) and cardinality of states 
thanks to the learning rate mechanism. 

The performance of the Adaptive Online HDP-HMM is not perturbed by these challenges because the 
learning rate tunes the adaptability of the parameters with respect to the observed data. In an evolutionary 
scenario, the likelihood of the observations given the current parameters is low. This causes the posterior 
covariance learning rate (r^) to increase, keeping the variance close to its prior. This, in turn, prevents a 
drift of the variance towards large values and allows for the mean to evolve. The concentration of around 
zero is an empirical support for this claim (see Figure [^). 

In the absence of the learning rate, the model still learns and recognises the new state thanks to properties 
of HDP. However, the overall performance deteriorates. On the one hand, new undesirable classes appear in 
response to drift. On the other, some of the existing classes collapse into a single one, due to considerable 
increase of variance caused by the class shifts. This rigid increase in variance does not allow the means to 
evolve, ultimately forcing the model to merge some of the neighboring states into a single class with a large 
variance (Figure [^,f). 

4.2 Activity recognition datasets 

In this section, we use two video datasets to assess the performance of the proposed model in activity 
recognition scenarios. 

4.2.1 Collated Weizmann dataset 

The Weizmann dataset contains 93 single-action videos from a set of 10 classes performed by 9 different 
actors. While the recognition accuracy on the original dataset is saturated |[35i ll36l . some studies have 
collated its individual actions into (unsegmented) sequences to experiment with time segmentation 0. Ina 
similar way, we have created 4 sequences, each consisting of 12 random actions selected from the provided 
action classes. Each sequence consists of approximately 900 frames. As feature set, we have used the 
position of the actor’s centroid in the image plane and the distances between the centroid and the actors’ 
contour along five given directions <J7l . 
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Accuracy 

FI score 

Cardinality 

Method 

SI 

S2 

S3 

S4 

SI 

S2 

S3 

S4 

SI 

S2 

S3 

S4 

Online HDP-HMM {ada r) 

0.82 

0.76 

0.89 

0.81 

0.92 

0.66 

0.95 

0.80 

0 

0 

0 

0 

Online HDP-HMM (r = 1) 

0.81 

0.70 

0.95 

0.80 

0.92 

0.66 

0.89 

0.80 

0 

-1 

0 

-1 

Ojfline HDP-HMM 

0.78 

0.76 

0.95 

0.81 

0.91 

0.66 

0.95 

0.81 

0 

0 

0 

0 

Offline Max-margin 

0.87 (avg) 

- 

- 


Table 2: Frame-level accuracy, segmentation FI score, and difference in decoded state cardinality for the 
adaptive online HDP-HMM variants and state-of-the-art studies on the collated Weizmann dataset. 



(a) Sequence 1 (b) Sequence 2 



(c) Sequence 3 (d) Sequence 4 


Figure 7: Estimated states for Weizmann dataset. Action labels are represented by colours. 


The estimated states of the AdOn HDP-HMM variants over the above sequences are visualised in 
Figure |7j showing remarkable qualitative accuracy in segmentation and classification. The quantitative 
results are reported in Table including an Offline variant representing the experiment with a single batch 
including the whole test sequence. This variant is run for the sake of comparison with a similar offline 
max-margin study m However, the results are not directly comparable for two reasons: a) the datasets 
are similar in conception, yet different in sequence collation, and b) the classifier in O operates over a 
closed set of classes, as opposed to ours that allows unlimited number of classes. The results with the fixed 
learning rate (r = 1) show a similar trend to the adaptive, and only a slightly lower average accuracy. This 
can be due to the stationary nature of the dataset, as training and test sequences are drawn from similar 
distributions and adaptation is not significant. In addition, the accuracy with the online processing does not 
show any noticeable deterioration over the full, offline processing. 

4.2.2 TUM kitchen dataset 

The TUM kitchen dataset is a human assistive dataset, consisting of natural unsegmented sequences of 
everyday activities performed in a typical kitchen environment [Tl • The dataset contains multi-modal data, 
annotated separately for the actors’ left and right hands (9 classes) and torso (2 classes). The features are 
28D vectors of joint coordinates for the torso and the relevant hands. The main actions include ‘Reaching’, 
‘Releasing Grasp Of Something’, ‘Taking An Object’, ‘Reaching Upward’, ‘Lowering An Object’, opening 
and closing doors and drawers and ‘Carrying While Locomoting’, the distinction of which are quite subtle 
at times even for human annotators. The main advantage of this dataset over the collated Weizmann is that 
the transitions between actions occur naturally and the boundaries are vague even to human annotation, 
hence time segmentation is more challenging. 
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In our experiments, we have performed segmentation and classification on the actions of the left and 
right hands, separately. All the sequences provided by the 3D motion capture sensors are used in leave-one- 
out cross validation tests. Experiments are run for both the typical sequences (denoted as 'robotic', taking 
objects one by one), and the more challenging ones {'complex' including sequences with multiple objects 
moved together, in arbitrary order and repeatedly). 

For a general study of performance, we run an experiment on all the above sequences involving both 
the robotic and complex. The difference between them is in state transition probabilities, height and size of 
actors and frequencies of action occurrence. The experiment is repeated with fixed and adaptive learning 
rates and results are compared in Table generally showing significant match in frame-level accuracy. The 
comparison between similar sequences with fixed vs. adaptive learning rate shows a minor improvement 
of frame-level accuracy and significant decrease of state cardinality error. Note that the figures under 
cardinality show differences between inferred vs. actual number of states. To facilitate visual evaluation, 4 
of the sequences are colour-plotted in FigureIt is worth noting that classes in this dataset may prove hard 
to segment. For instance, distinction between putting object on the table and leaving grasp of it can be very 
subtle (the back-to-back lavender-blue and light-blue colours in Figure [^. This becomes more challenging 
when a model has extra degrees of freedom for deriving a dynamic number of classes and explains the 
negative cardinality in the results. 

To specifically observe the adaptive behaviour, we have trained the model on the robotic sequences and 
tested it on complex ones. Although the emission parameters might not radically change in this scenario, 
the transition probabilities need to adapt due to changes in the order of actions in the complex set. Table 
l^can be used to observe the remarkable contribution of the learning rate mainly in cardinality and overall 
accuracy. Similar to the synthetic results, in the presence of learning rates the model is able to prevent an 
excessive increase of the variance and avoid neighboring classes to collapse into one (the phenomena that 
can be observed when r = 1 in Figures [^,e). 

To evaluate the ability to recognise new classes, we have taken the first 4 sequences and removed 
the observations related to ‘Lowering an object’ (shown in lavender-blue in Figure in all but the first 
sequence. We have then trained the model on sequences 2-4 and tested on the sequence containing the new 
action. AdOn HDP-HMM is able to recognise a new action (brown in Figurej^ and learn its parameters 
with consistent future recognition. This significant property of the model is inherent to the HDP approach 
and the behaviour is similar, irrespective of whether or not the learning rate is utilised. 

The closest study on the TUM kitchen dataset leverages a CRF I?!. This method is not directly com¬ 
parable to ours since AdOn HDP-HMM is online, adaptive and with a dynamic class set. To create a 
closer match, we have run the Offline variant of AdOn HDP-HMM, the results of which are similar to 
the CRF and outperforming it for complex sequences. This finding aligns with our principal claim that 
adaptability leads to remarkable improvements when the test distributions are different from the training. 
The distribution of and (the transition-related learning rates) for these experiments are mainly peaked 
around 0 . 1 , indicating that the learning rates encourage the model to rely on the observed data to infer the 
HDP transition probabilities, which translates into more adaptability. 

4.3 Sampling efficiency and computational time 

We next examine the Gibbs sampler’s mixing rate and execution time for the above experiments. To gain an 
overall understanding of parameter mixing (emission and transition) the log-likelihood is shown in Figure 

Since most of the sampled variables contribute to the likelihood calculation, the well-mixed results indi¬ 
cate general mixing efficiency in the model. Additionally, mixing trends of the learning rates (r^, T/ 3 ) 

for a generic evolutionary run are shown Figure [^-d, both to monitor mixing and support the experiments’ 
discussion. The large values of prevents the model from the immediate tendency to increase the variance 
to fit the changing distributions. Rather, the model allows for the means to evolve, by converging to small 
values of r^. The similarly small values of and ensure adaptability of the model towards changing 
state transitions for HDP-HMM. Through the orchestration of these parameters, the proposed model can 
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Accuracy 

Cardinality 

Sequences 

RH 

LH 

RH 

LH 

ada T 

r = 1 

ada T 

r = 1 

ada T 

r = 1 

ada T 

r = 1 

Online Seq 0-0 

0.79 

0.81 

0.73 

0.71 

0 

-1 

-1 

-1 

Online Seq 0-1 

0.79 

0.82 

0.75 

0.75 

-2 

-1 

0 

-1 

Online Seq 0-2 

0.76 

0.70 

0.78 

0.75 

-2 

-1 

-1 

-1 

Online Seq 0-3 

0.84 

0.84 

0.67 

0.69 

0 

-2 

0 

-1 

Online Seq 0-4 

0.70 

0.69 

0.71 

0.72 

1 

-1 

-2 

-3 

Online Seq 0-6 

0.51 

0.48 

0.56 

0.55 

-3 

-6 

-1 

-3 

Online Seq 0-7 

0.45 

0.48 

0.57 

0.55 

-3 

-4 

-1 

-3 

Online Seq 0-8 

0.64 

0.68 

0.62 

0.63 

-1 

-3 

-2 

-2 

Online Seq 0-9 

0.73 

0.71 

0.70 

0.69 

0 

-2 

-1 

-2 

Online Seq 0-10 

0.79 

0.79 

0.68 

0.70 

0 

-2 

0 

-1 

Online Seq 0-11 

0.70 

0.76 

0.63 

0.63 

-5 

-4 

-2 

-3 

Online Seq 0-12 

0.64 

0.64 

0.58 

0.55 

-1 

-2 

-3 

-5 

Online Seq 1-0 

0.65 

0.69 

0.68 

0.69 

-1 

-2 

-3 

-4 

Online Seq 1-1 

0.71 

0.69 

0.65 

0.62 

-1 

-1 

-2 

-3 

Online Seq 1-2 

0.63 

0.63 

0.76 

0.74 

-1 

-1 

0 

-1 

Online Seq 1-3 

0.14 

0.14 

0.66 

0.65 

-6 

-6 

0 

0 

Online Seq 1-4 

0.64 

0.67 

0.74 

0.71 

-4 

-5 

0 

-1 

Online Seq 1-5 

0.67 

0.67 

0.61 

0.61 

0 

0 

-2 

-2 

Online Seq 1-7 

0.69 

0.68 

0.60 

0.58 

-1 

-1 

-1 

-2 

robotic sequences 

Avg Online 

0.80 

0.79 

0.73 

0.73 

1.00 

1.25 

0.5 

1.00 

Avg Offline 

0.80 

0.81 

0.74 

0.73 

1.00 

1.50 

1.00 

1.10 

Awg Offline Q 

p 

bo 

{avg) 





complex sequences 

Avg Online 

0.66 

0.66 

0.67 

0.66 

1.68 

2.37 

1.16 

2.26 

Avg Offline 

0.66 

0.66 

0.67 

0.66 

1.48 

2.28 

1.23 

2.26 

As/g Offline Q 

o 

b^ 

{avg) 






Table 3: Frame-level accuracy and state cardinality error for Adaptive Online HDP-HMM on all TUM 
kitchen sequences. The comparison between similar sequences with fixed and adaptive learning rate (r) 
shows incremental improvement on frame-level accuracy and significant decrease on state cardinality error. 
Note that the figures under cardinality show differences between inferred vs. actual number of states, 
considering the sign. However the absolute values are utilised to calculate the average cardinality error. 
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Figure 8: Estimated states for the TUM kitchen dataset using the Adaptive Online HDP-HMM. LH and 
RH stand for left and right hand. The first two columns show robotic sequences, whereas the third column 
includes complex ones, (c) is a human sequence with altered orders of actions performed spontaneously 
and (d) contains a new action shown in lavender-blue in the ground-truth and recognised by the model 
in a random brown colour. In most cases, using adaptive r causes noticeable improvements on both the 
performance and cardinality of inferred states. 



Accuracy 

Cardinality 

Sequences 

RH 

LH 

RH 

LH 

ada T 

r = 1 

ada T 

r = 1 

ada T 

r = 1 

ada T 

r = 1 

Online Actor 1, complex 

0.73 

0.72 

0.65 

0.68 

-2 

-3 

2 

-2 

Online ActorS, complex 

0.55 

0.54 

0.52 

0.49 

-1 

-3 

-4 

-6 

Online Actor 1, repetitive 

0.45 

0.48 

0.57 

0.55 

-3 

-4 

-1 

-3 


Table 4: Adaptability experiment: frame-level accuracy and state cardinality error for Adaptive Online 
HDP-HMM trained with the robotic sequences and tested on the complex ones. The comparison between 
similar sequences with/without learning rate (r) shows noticeable improvement on frame-level accuracy 
and significant decrease on state cardinality error. 
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(e) Loglikelihood 


(f) Computational time 


Figure 9: Sampling efficiency and computational time: (a-d) Sample mixing for all the 4 proposed learning 
rates across three online batches and (e) Log likelihood plot for first batch of a Weizmann run, show well 
mixing and convergence both for the learning rates and generally for all other parameters involved in likeli¬ 
hood calculation, (f) Computation time per frame (seconds) for Weizmann and TUM kitchen datasets, over 
the online and offline variants, with/without learning rates. 


adapt to changes in the streaming batches with more exact account of the true cardinality of the classes and 
be immune from collapsing neighboring classes into a single one. 

Eventually, the computational time per frame for runs on an Intel Xeon E5 2.90 GHz processor, over 
the Weizmann and TUM kitchen datasets are shown in Figure]^. The boxplot includes online and offline 
variants, with and without learning rates to help explore how using the learning rates and online scheme 
can affect the computational time. Based on the elapsed time (in seconds), the offline run is the fastest 
since all the data are processed in a single batch. The adaptive online runs occur in 3-4 batches of 1000 
iterations each, therefore indicate an increase of about 5-10 ms in completion time. Adapting the learning 
rate can cause between 3-10 ms delay, yet given the discussed benefits particularly for evolving sequences, 
this latency is quite reasonable. It is important to mention that given the initial bootstrap training, the Gibbs 
algorithm converges rapidly allowing for the model to run in acceptable time. Overall, using the learning 
rate ensures multiple improvements without imposing excessive computational load on the system. 


5 Conclusion 

In this paper, we have proposed a novel, adaptive online model suited for on-the-fiy time segmentation 
and recognition of sequential data. The proposed AdOn HDP-HMM is capable of online segmentation 
and classification of streaming batches of data over incremental class sets. The main contribution of this 
model is the unsupervised posterior adaptation of the parameters over the successive data batches. This 
is accomplished by using a learning rate that dynamically tunes the model balancing the impact of the 
current batch with the memory accumulated so far. This proves an effective solution for online sequential 
estimation problems requiring adaptation over evolving distributions. 

The performance of AdOn HDP-HMM is evaluated via a number of experiments including stationary 
and evolutionary scenarios. Thereby, we have tested the general segmentation and classification accuracies 
in addition to the ability to detect the correct number of classes. The results are reported on variations of 


18 



















































synthetic data and two activity recognition video datasets (Collated Weizmann and TUM Assistive Kitchen). 
The proposed model has achieved a remarkable accuracy in all cases, and considerable improvements in 
evolutionary scenarios. 

Thanks to the unsupervised adaptive online estimation and the capacity to learn over infinite class sets, 
the proposed AdOn HDP-HMM can be a solution for sequential estimation in a number of scenarios 
which have received relatively little attention in the literature. Not relying on human intervention, revision 
or correction of estimated labels, this model can be a suitable candidate for streaming applications. In 
addition, although designed for evolutionary distributions, its accuracy over stationary data has proved 
higher than or equal to that of the most comparable results, and the computational load is not affected 
significantly. 

6 Appendix A: The balancing effect of r 

In this section we address posterior inference of parameters and explore how the prior and likelihood dis¬ 
tributions convey the knowledge of the current observations and accumulated summary of previous data. 
Considering the online HDP-HMM model with parameters 0, observations Y and learning rate r, the pos¬ 
terior for parameters in the r^th 

batch is: 


P{(t^n\Yn, 4>n-l , t) CX p{Yn\(f)n-l)p{4>n-iy 

N / n-1 N \ ^ 

OC '[\p{yn,Mn-l) P{M n (18) 

■ v“ V ^ 

N N{n-l)r 

As more batches stream in (i.e. n increases), the weight of prior is accumulated and adaptivity to new 
data declines. The learning rate, however, can be used as an equaliser that controls the balance of prior 
versus likelihood and tunes model adaptivity. For positive values of r < 1, the model discounts the impact 
of accumulated previous data and allows for more adaptivity. However, when r > 1, posterior (j)^ is inclined 
to follow the prior more strictly. In other words, r can be seen as the scaling coefficient for the number of 
‘pseudo-observations ’ in the prior. 


7 Appendix B: Conjugacy for 

To sample from the posterior, ideally we would like to consider a conjugate prior that analytically derives 
the posterior hyper-parameters, given those of the prior and the sufficient statistics of the current data. A 
candidate prior for the IW distribution is the Gamma. In this section, we investigate if the Gamma can be 
proven a conjugate prior for the IW likelihood, considering the impact of the learning rates on and z/'. 

Given the proposed learning rate model, the probability density function of the Inverse-Wishart distribu¬ 
tion can be redefined as below. We have derived the new hyper-parameters through conversion to canonical 
parameters, multiplication with the learning rate and reversion to the standard form to simplify sampling. 




IWy<l>,U,T) 


v' = r(i/ + p + 1) 

CT-\-c' 


1 = CT + c' 


l^- 






CT + c'+p + l / 1 

exp ( —-tr(T'^V ) 


( 19 ) 


We assume Gamma is the conjugate prior distribution for sampling r, and try to prove it below. 
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G(t|S, a\p*) cx IW{T.\^, v, r)G(r|a, /3) 
G(t|S,q;*,/3*) cx 


I ^ I CT + c' 


2c^r^(^) 


^l_ e.+c'+p+i exp(-/3T) 


r(«) 


Thanks to proportionality, we can remove the constant terms with respect to t: 


G{T\^,a*,/3*) oc 


" exp ^))t“ ^exp(-/3r) 


( 21 ) 


Ideally, we should create terms proportional to ^ exp (—r (^tr(^ + /3)) but because r af¬ 

fects both hyper-parameters, the initial term related to the degrees of freedom (u) is also dependent on r: 


G(T|S,a*,r)ocM^^T“-iexp(^-T Qtr(T-I]-i) + (22) 

To conclude, the Inverse-Wishart is only conjugate to the Gamma for the scale parameter (or a scaling 
coefficient over parameter T^), and cannot be used as a conjugate prior for deriving the posterior distribution 
of te. 


8 Appendix C: Conjugacy for 

Let us consider a multivariate Normal distribution in a fully general case (Eq. |^, with a conjugate Gamma 
prior over random variable r. We will show that the conjugacy holds for this setting, through expanding 
the right hand side of the proportionality and deriving the posterior hyper-parameters in the presence of a 
single sample of data (^4), i.e. = 1. The resulting parameters can be easily extended to generalise to the 
case of N observations. 


G{T\A,a*,(3*) (X NiA\^i, —)G(T|a,^), 

T(J 

oc ^ exp(-^lA-fi)^S-^(A 


a,T > 0 

m))) X exp(-/3T) 

r{a) 


discarding the terms that are independent of the random variable r, we will have: 
oc exp(-/?r - ^(A - - fi))) 

oc exp(-r(^ + | - m)))) 


The remaining terms are proportional to a Gamma distribution with the following parameters: 


(23) 


a*=a + l/2, p* =p+^(A-iifE-\A-^i) (24) 
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9 Appendix D: Impacts of r on parameter distributions 

In this appendix, we explore the impact of r on changing the mean and covariance of the Inverse-Wishart. 
As mentioned in the paper, approximately in all cases the mean stays unchanged and the variance is scaled 
inversely to the learning rate. 


IWm^.vY oc 

exp = |E|-exp ^-itr(T5'E“^)^ 

fa exp ^-itr(r4'E“^)^ 


(25) 


Accepting the approximation above, the resulting S samples are drawn approximately around the same 
mean, but with a scaled variance. When 0 < < 1 the variance increases, whereas for > 1 the 

distribution is more peaky. In other words, the posterior samples of S in the former case are allowed to 
move away from the IW mean, tending to have greater adaptability towards the current observed data, but in 
the latter case the posterior samples concentrate around the prior mean, discouraging covariance adaptation. 


mean of S ^ IW (^, v) : 
mean of ^ ^ru) : 

variance of S ^ IW (^, u) : 
variance of ^ IW (r^, ru) : 


Me = 




z/ + 1 ’ 
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rz/ + p + 1 
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